{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [ Chapter 13 - Semantic Search with Dense Vectors ] \n",
    "## Setting up the outdoors dataset\n",
    "\n",
    "We're going to use the Outdoors dataset for this chapter, and for a very important reason: the vocabulary and contexts in the outdoor question and answer __domain__ already have good coverage in the Transformer models we'll be using.\n",
    "\n",
    "This is because the datasets that were used to train the model include sources that are likely to have similar subject matter.  Wikipedia was used to train bert-base-uncased (https://huggingface.co/bert-base-uncased#training-data) and, surprise! wikipedia has a section specifically on outdoors content: https://en.wikipedia.org/wiki/Outdoor\n",
    "\n",
    "This is important, because if the words and their contexts haven't been seen before, the model will be less accurate.\n",
    "\n",
    "Also, who doesn't enjoy playing around with a new dataset?! Data is search nerd candy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import html\n",
    "import json\n",
    "\n",
    "from IPython.display import HTML, display\n",
    "\n",
    "from aips import get_engine\n",
    "engine = get_engine(\"solr\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 13.1\n",
    "\n",
    "### Creating our Collection and Indexing the documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into 'outdoors'...\n",
      "remote: Enumerating objects: 25, done.\u001b[K\n",
      "remote: Counting objects: 100% (25/25), done.\u001b[K\n",
      "remote: Compressing objects: 100% (24/24), done.\u001b[K\n",
      "remote: Total 25 (delta 0), reused 22 (delta 0), pack-reused 0 (from 0)\u001b[K\n",
      "Receiving objects: 100% (25/25), 491.39 MiB | 4.71 MiB/s, done.\n",
      "Updating files: 100% (23/23), done.\n",
      "Already up to date.\n",
      "README.md\n",
      "concepts.pickle\n",
      "._guesses.csv\n",
      "guesses.csv\n",
      "._guesses_all.json\n",
      "guesses_all.json\n",
      "outdoors_concepts.pickle\n",
      "outdoors_embeddings.pickle\n",
      "._outdoors_golden_answers.csv\n",
      "outdoors_golden_answers.csv\n",
      "._outdoors_golden_answers.xlsx\n",
      "outdoors_golden_answers.xlsx\n",
      "._outdoors_golden_answers_20210130.csv\n",
      "outdoors_golden_answers_20210130.csv\n",
      "outdoors_labels.pickle\n",
      "outdoors_question_answering_contexts.json\n",
      "outdoors_questionanswering_test_set.json\n",
      "outdoors_questionanswering_train_set.json\n",
      "._posts.csv\n",
      "posts.csv\n",
      "predicates.pickle\n",
      "pull_aips_dependency.py\n",
      "._question-answer-seed-contexts.csv\n",
      "question-answer-seed-contexts.csv\n",
      "question-answer-squad2-guesses.csv\n",
      "._roberta-base-squad2-outdoors\n",
      "roberta-base-squad2-outdoors/\n",
      "roberta-base-squad2-outdoors/._tokenizer_config.json\n",
      "roberta-base-squad2-outdoors/tokenizer_config.json\n",
      "roberta-base-squad2-outdoors/._special_tokens_map.json\n",
      "roberta-base-squad2-outdoors/special_tokens_map.json\n",
      "roberta-base-squad2-outdoors/._config.json\n",
      "roberta-base-squad2-outdoors/config.json\n",
      "roberta-base-squad2-outdoors/._merges.txt\n",
      "roberta-base-squad2-outdoors/merges.txt\n",
      "roberta-base-squad2-outdoors/._training_args.bin\n",
      "roberta-base-squad2-outdoors/training_args.bin\n",
      "roberta-base-squad2-outdoors/._pytorch_model.bin\n",
      "roberta-base-squad2-outdoors/pytorch_model.bin\n",
      "roberta-base-squad2-outdoors/._vocab.json\n",
      "roberta-base-squad2-outdoors/vocab.json\n"
     ]
    }
   ],
   "source": [
    "#outdoors\n",
    "![ ! -d 'outdoors' ] && git clone --depth=1 https://github.com/ai-powered-search/outdoors.git\n",
    "! cd outdoors && git pull\n",
    "! cd outdoors && cat outdoors.tgz.part* > outdoors.tgz\n",
    "! cd outdoors && mkdir -p '../data/outdoors/' && tar -xvf outdoors.tgz -C '../data/outdoors/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wiping \"outdoors\" collection\n",
      "Creating \"outdoors\" collection\n",
      "Status: Success\n",
      "Successfully written 18456 documents\n",
      "root\n",
      " |-- id: integer (nullable = true)\n",
      " |-- accepted_answer_id: integer (nullable = true)\n",
      " |-- parent_id: integer (nullable = true)\n",
      " |-- creation_date: timestamp (nullable = true)\n",
      " |-- score: integer (nullable = true)\n",
      " |-- view_count: integer (nullable = false)\n",
      " |-- body: string (nullable = true)\n",
      " |-- owner_user_id: string (nullable = true)\n",
      " |-- title: string (nullable = true)\n",
      " |-- tags: array (nullable = true)\n",
      " |    |-- element: string (containsNull = true)\n",
      " |-- answer_count: integer (nullable = true)\n",
      " |-- post_type: string (nullable = true)\n",
      " |-- url: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from aips.data_loaders.outdoors import load_dataframe\n",
    "\n",
    "outdoors_collection = engine.create_collection(\"outdoors\")\n",
    "outdoors_dataframe = load_dataframe(\"data/outdoors/posts.csv\")\n",
    "outdoors_collection.write(outdoors_dataframe)\n",
    "outdoors_dataframe.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 13.2\n",
    "\n",
    "### Exploring data for a question regarding `climbing knots`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_posts_for_question(query):\n",
    "    request = {\"query\": query,\n",
    "               \"limit\": 1, \n",
    "               \"query_fields\": [\"title\", \"body\"],\n",
    "               \"return_fields\": [\"id\"],\n",
    "               \"filters\": [(\"post_type\", \"question\"),\n",
    "                           (\"accepted_answer_id\", \"*\")]}\n",
    "    id = outdoors_collection.search(**request)[\"docs\"][0][\"id\"]\n",
    "    request = {\"query\": id,\n",
    "               \"query_fields\": [\"id\", \"parent_id\"],\n",
    "               \"limit\": 3,\n",
    "               \"return_fields\": [\"id\", \"post_type\", \"title\", \"body\", \"parent_id\", \"accepted_answer_id\"]}\n",
    "    return outdoors_collection.search(**request)[\"docs\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'id': '18825',\n",
       "  'accepted_answer_id': 18826,\n",
       "  'body': \"If I wanted to learn how to tie certain knots, or learn about new knots and what they're used for, what are some good resources to look up?\",\n",
       "  'title': \"What's a good resource for learning to tie knots for climbing?\",\n",
       "  'post_type': 'question'},\n",
       " {'id': '24440',\n",
       "  'parent_id': 18825,\n",
       "  'body': \"Knots and Ropes for Climbers by Duane Raleigh is a fantastic illustrated resource tailored specifically to climbers. The ABoK is great, but a but beyond the pale of what the average rock climber needs to know. The linked book is a fabulous climber's only reference.\",\n",
       "  'post_type': 'answer'},\n",
       " {'id': '18826',\n",
       "  'parent_id': 18825,\n",
       "  'body': \"Animated Knots By Grog Arguably the best resource online for knot tying is Animated Knots by Grog , it's used by virtually every avid knot tyer I've known. They have excellent step-by-step animations for tying hundreds of common knots for rock climbing, sailing, rescue work, fishing, etc. There are also video tutorials for each knot, and they also have an amazing mobile app that you take with you for times when you're out on an adventure and forget how to tie a knot (which I've done fishing on a number of occations).\",\n",
       "  'post_type': 'answer'}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_posts_for_question(\"climbing knots\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 13.3\n",
    "\n",
    "### Querying our collection with a noun phrase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_questions(query, response, verbose=False):\n",
    "    print(f\"Query {query}:\\n\")\n",
    "    if verbose:\n",
    "        highlights = [response[\"highlighting\"][k] for k in response[\"highlighting\"].keys()]\n",
    "    print(\"Ranked Questions:\")\n",
    "    for i, q in enumerate(response[\"docs\"]):\n",
    "        if verbose:\n",
    "            print(json.dumps(q, indent=\"  \"))\n",
    "        if \"title\" in q.keys():\n",
    "            id = f'<a href=\"{q[\"url\"]}\">{q[\"id\"]}</a>'\n",
    "            display(HTML(f'<strong>Question {id}: </strong>{q[\"title\"]}'))\n",
    "        if verbose:\n",
    "            display(HTML(\"<strong>Body:</strong>\" + html.unescape(str(highlights[i][\"body\"][0]))))\n",
    "            display(HTML(\"<hr>\"))\n",
    "\n",
    "def search_questions(query, verbose=False):\n",
    "    request = {\"query\": query,\n",
    "               \"query_fields\": [\"title\", \"body\"],\n",
    "               \"limit\": 5,\n",
    "               \"return_fields\": [\"id\", \"url\", \"post_type\", \"title\",\n",
    "                                 \"body\", \"accepted_answer_id\", \"score\"],\n",
    "               \"filters\": [(\"post_type\", \"question\")],\n",
    "               \"order_by\": [(\"score\", \"desc\"), (\"title\", \"asc\")]}\n",
    "    response = outdoors_collection.search(**request)\n",
    "    display_questions(query, response, verbose)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query climbing knots:\n",
      "\n",
      "Ranked Questions:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/21855\">21855</a>: </strong>What are the four climbing knots used by Jim Bridwell?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/18825\">18825</a>: </strong>What's a good resource for learning to tie knots for climbing?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/18814\">18814</a>: </strong>How to tie a figure eight on a bight?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/9183\">9183</a>: </strong>Can rock climbers easily transition to canyoning?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/22477\">22477</a>: </strong>Tradeoffs between different stopper knots"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "search_questions(\"climbing knots\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Listing 13.4\n",
    "\n",
    "### Querying our collection with a question"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query What is DEET?:\n",
      "\n",
      "Ranked Questions:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/20403\">20403</a>: </strong>What is bushcrafting?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/20977\">20977</a>: </strong>What is \"catskiing\"?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/1660\">1660</a>: </strong>What is Geocaching?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/17374\">17374</a>: </strong>What is a tent skirt and what is its purpose?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/20323\">20323</a>: </strong>What is a kassakåta?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "search_questions(\"What is DEET?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query snow camping safety:\n",
      "\n",
      "Ranked Questions:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/1966\">1966</a>: </strong>Best approach to camping in snow?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/20725\">20725</a>: </strong>How do people carry camping propane in a car when traveling?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/18942\">18942</a>: </strong>Resources for camping in Georgia (country)"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/13209\">13209</a>: </strong>Any tips to prevent theft while tent camping alone in a caravan park?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong>Question <a href=\"https://outdoors.stackexchange.com/questions/8700\">8700</a>: </strong>What are some simple tasks to teach knife safety?"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "search_questions(\"snow camping safety\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up next: [Introduction to Transformers](2.introduction-to-transformers.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
